arXiv:1503.03613vl [stat.ML] 12 Mar 2015 


On the Impossibility of Learning the Missing Mass 


Elchanan Mossel mossel@gmail.com 

Department of Statistics, The Wharton School, University of Pennsylvania 
Departments of Statistics and Computer Science, University of California, Berkeley 

Mesrob I. Ohannessianl” ^ mesrob@gmail.com 

University of California, San Diego 


Abstract 

This paper shows that one cannot learn the probability of rare events without imposing further struc¬ 
tural assumptions. The event of interest is that of obtaining an outcome outside the coverage of an 
i.i.d. sample from a discrete distribution. The probability of this event is referred to as the “missing 
mass”. The impossibility result can then be stated as: the missing mass is not distribution-free 
PAC-learnable in relative error. The proof is semi-constructive and relies on a coupling argument 
using a dithered geometric distribution. This result formalizes the folklore that in order to predict 
rare events, one necessarily needs distributions with “heavy tails”. 
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1. Introduction 

Given data consisting of n i.i.d. samples Ai, • • • , from an unknown distribution p over the 
integers N+, we traditionally compute the empirical distribution'. 

1 ” 

Pn{x) := - X l{Xi = x}. 

n 

2=1 

To estimate the probability p{E) of an event E C N+, we could use pn{E). This works well 
for abundantly represented events, but not as well for rare events. An unequivocally rare event is 
the set of symbols that are missing in the data, 

En '■= {x G N+ : p{x) = 0}. 

The probability of this (random) event is denoted by the missing mass'. 

,Xn) :=p{En) = ^ pix)l{p{x) = 0 }. 

trEN-i- 


The question we strive to answer in this paper is: “Can we learn the missing mass when p is an 
arbitrary distribution on N+?” Definition 1 phrases this precisely in the PAC-learning framework. 
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Definition 1 An estimator is a sequence of functions Mn{xi, • • • , Xn) '■ N” ^ [0,1]. 'We say that 
an estimator PAC-leams the missing mass in relative error with respect to a family V of distribu¬ 
tions, if for every p gV and every e,5 > 0 there exists no{p, e, S) such that for all n > no{p, e, (5).' 



The learning is said to be distribution-free, ifV consists of all distributions on N+. 

Our question thus becomes: Can we distribution-free PAC-learn the missing mass in relative 
error? It is obvious that the empirical estimator p{En) gives us the trivial answer of 0, and cannot 
learn the missing mass. A popular alternative is the Good-Turing estimator of the missing mass, 
which is the fraction of singletons in the data: 



xSN+ 


The Good-Turing estimator has many interpretations. Its original derivation by Good (1953) 
uses an empirical-Bayes perspective. It can also be thought of as a leave-one-out cross-validation 
estimator, which contributes to the missing set if and only if the holdout appears exactly once in the 
data. Fundamentally, derives its form and its various properties from the simple fact that: 


^ p(x)(l -p(x))”-i = E[M„_i]. 




A study of Gn in the PAC-leaming framework was first undertaken by McAllester and Schapire 
(2000) and continued later by McAllester and Ortiz (2003). Some further refinement and insight 
was also given later by Berend and Kontorovich (2013). These works focused on additive er¬ 
ror. Ohannessian and Dahleh (2012) shifted the attention to relative error, establishing the PAC- 
learning property of the Good-Turing estimator with respect to the family of heavy-tailed (roughly 
power-law) distributions, e.g. p{x) oc with a G (0, 1). This work also showed that Good- 

Turing fails to learn the missing mass for geometric distributions, and therefore does not achieve 
distribution-free learning. More recently, Ben-Hamou et al. (2014) provide a comprehensive and 
tight set of concentration inequalities, which can be interpreted in the current PAG framework, and 
which further demonstrate that Good-Turing can PAC-learn with respect to heavier-than-geometric 
light tails, e.g. the family that includes p{x) oc with a G (0,1) in addition to power-laws. 

These results leave open the important question of whether there exists some other estimator 
that can PAC-leam the missing mass in relative error in a distribution-free fashion (i.e. for any 
distribution p). Our main contribution is to prove that there are no such estimators. 

The hrst insight to glean from this impossibility result is that one is justihed to use further 
structural assumptions when learning about rare events. Furthermore, the proof relies on an implicit 
construction that uses a dithered geometric distribution. In doing so, it shows that the failure of the 
Good-Turing estimator for light-tailed distributions is not a weakness of the procedure, but is rather 
due to a fundamental barrier. Conversely, the success of Good-Turing for heavier-than-geometric 
and power laws shows its universality, in some restricted sense. In particular, in concrete support to 
folklore (e.g. Taleb, 2008), we can state that for estimating probabilities of rare events, heavy tails 
are both necessary and sufficient. 
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The paper is organized as follows. In Section 2, we present our main result, with a detailed 
exposition of the proof. In Section 3 we give an immediate extension to continuous tail estima¬ 
tion, show that parametric light-tailed learning is possible, comment further on the Good-Turing 
estimator, and concisely place this result in the context of a chief motivating application, that of 
computational linguistics. Lastly, we conclude in Section 4 with a summary and open questions. 

Notation 

We use the shorthand • • • , Xn) for the missing mass and = M„(Xi, • • • , Xn) 

for its estimator, keeping implicit their dependence on the samples and, in the case of M„, on the 
distribution p. 

2 . Main Result 

Our main result is stated as follows. The rest of this section is dedicated to its detailed proof. 

Theorem 2 There exists a positive e > 0 and a strictly increasing sequence {nk)k=i, 2 ,---> such that 
for every estimator there exists a distribution p*, such that for all k: 



^nk ^ 

>4 

1 

Mnf. 

i 


In particular, it follows that it is impossible to perform distribution-free PAC-learning of the missing 
mass in relative error. 

Remark 3 Our proof below implies the statement of the theorem with e = 10“^ and = 6.5 • 2^, 
but we did not make an honest effort to optimize these parameters. 

2.1. Proof Outline 

Consider the family of /3-dithered geometric(^) distributions, where the mass of each outcome 
beyond a value m of a geometric(^) random variable is divided between two sub-values, with a 
fraction (3 in one and 1 — /3 in the other. More precisely: 

Definition 4 The ffdithered geometricfamily is a collection of distributions parametrized by 
the dithering choices 9 E {/3,1 — /3}+, with jd E (0, ^), as follows: 

'P/3,m = ^Pe ■ Pe{x) = ^, X = 1, • • • , m; 

pe{m + 2j - 1) = peim + 2j) = j E N+, 6» E {ff 1 - /3}^+| . (2) 

The intuition of the proof of Theorem 2 is that within such light-tailed families, two distributions 
may have very similar samples and thus estimated values, yet have significantly different true values 
of the missing mass. This follows the general methodology of many statistical lower bounds. We 
now state the outline of the proof. We choose a subsequence of the form = C2^. We set 
/3 = 1/4, m = 1, and C = 6.5. The value of e > 0 is made explicit in the proof, and depends only 
on these choices. We proceed by induction. 
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We show that there exists Ot such that for all 6 with 9i = 6t we have for n = ni: 


Pe 


_ 1 

Mn ^ 




( 3 ) 


• Then, at every step A: > 1 : 

(H) We start with (0^, • • • , such that for all 6 with (0i, • • • , 9i._i) = (0^, • • • , 
Inequality (3) holds forn = ni, • • • , Uk-i- 

(*) We then show that it must be that for at least one of 0 = /3 or 0 = 1 — /3, for all 9 with 
(01, • • • ,9k) = (01, • • • , 0fc_i, 0), Inequality (3) holds additionally for n = n^. We 
select 0^ to be the corresponding 0. 

• This induction produces an infinite sequence 0* G {/3,1 — /3}+, and the desired distribution 
in Theorem 2 can be chosen as p* = po*, since it is readily seen to satisfy the claim for each 
Uk, by construction. 


2.2. Proof Details 

We skip the proof of the base case, since it is mostly identical to that of the induction step. Therefore, 
in what follows we are given (0i, • • • , 0^_i) by hypothesis (H), and we would like to prove that the 
selection in (*) can always be done. Let us denote the two choices of parameters by 

0:= (0^,0^i,/3,0fc+i,---), 


and 

0 ' := {9l9l_„l-P,9'k+i,---), 

and let us refer to (0fc+i, • • •) and (0fc_|_i, • • •) by the trailing parameters. What we show in the 
remainder of the proof is that with two arbitrary sets of trailing parameters, we cannot have two 
simultaneous violations of Inequality (3) (for both 0 and 9'). That is, we cannot have both: 


P 


Pe 


Mn 


Mn 


- 1 



< e 


and 



> e ^ < e. 


(4) 


This is shown in Lemma 8, in the last portion of this section. To see why this is sufficient to 
show that the selection in (*) can be done, consider first the case that Inequality (3) is upheld for 
both 0 and 9' with any two sets of trailing parameters. In this case we can arbitrarily choose 0^ to be 
either /3 or 1 — /3, since the induction step is satisfied. We can therefore focus on the case in which 
this fails. That is, for either 0 or 0' a choice of trailing parameters can be made such that Inequality 
(3) with n = n/c is not satisfied, and therefore one of the two cases in (4) holds [say, for example, for 
0]. Fix the corresponding trailing parameters [in this example, (0a:+i, •••)]• Then, for any choice 
of the other set of trailing parameters [in this example, (0fc_|_i, •••)]’ Lemma 8 precludes a violation 
of Inequality (3) for n = by the other choice [in this example, 0']. Therefore this choice can be 
selected for 9k [in this example, 0^ = 1 — /3.] 

By using the coupling device and restricting ourselves to a pivotal event, we formalize the 
aforementioned intuition that the estimator may not distinguish between two separated missing 
mass values, and deduce that both statements in (4) cannot hold simultaneously. 
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Coupling 

Definition 5 A coupling between two distributions p and p' on N+ is a joint distribution q on 
such that the first and second marginal distributions of q revert back to p and p' respectively. 


Couplings are useful because probabilities of events on each side may be evaluated on the joint 
probability space, while forcing events of interest to occur in an orchestrated fashion. Going back 
to our induction step and the specific choices 6 and O' with arbitrary trailing parameters, we perform 
the following coupling: 


q{x,x') 


pe{x) = pefx') 
^/2m+fc 

< (l-2^)/2™+^ 

Pe{x)pe'{x') 

. 0 


ii X = x' < m+ 2k — 1] 

ii X = x' = m + 2k — l,or if X = x' 

if X = m + 2 k, x' = m2k — 1] 

ifx,x' > m + 2k-, 

otherwise. 


m + 2 k, 


(5) 


It is easy to verify that q in Equation (5) is a coupling between pQ and p 0 i as in Definition 5. Note 
the resulting outcomes. If X, X' are generated according to q, then if either is in {1, • • • , m+2A:—2} 
then both values are identical. If either is in {m + 2A: + 1, • • • } then so is the other, but otherwise 
the two values are conditionally independent. If either is in {m + 2k — l,m + 2k}, so is the other, 
and the conditional probability is given by: 


X, x' 

m + 2k — 1 

m + 2k 

m + 2k — 1 


0 

m + 2k 

1-2/3 

/3 


Now consider coupled data (X*, X')j=igenerated as i.i.d. samples from q. It follows that, 
marginally, the X-sequence is i.i.d. from pg, and so is the X'-sequence from pg/. Any event B that 
is exclusively X-measurable or B' that is exclusively X'-measurable has the same probability under 
the coupled measure. That is. 


PpfB)=Pg{B) :=q'^{BxNl) 

and 

Pp^,iB') = PfB') := X S'). 

In what follows we work only with coupled data, and use simply the shorthand P to mean P^. 
Pivotal Event 

The event we would like to work under is that of the coupled samples being identical, while exactly 
covering the range I,-- - ,m + 2A; — 1: 

rik 

Ak = f]{Xi = X'i} n {{Xi,...,XnJ = {l,--- ,m + 2k-l}y (6) 

i=l 

The reason interests us is that it encapsulates the aforementioned intuition. 
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Lemma 6 Under event A^, the coupled missing masses are distinctly separated, 

Mn, 2-/3 

M'n, 1 + / 3 ’ 

while any estimator cannot distinguish the coupled samples, 

Mn, = M'n,- 


Proof The confusion of any estimator is simply due to the fact that under A^, the coupling forces 
all samples to be identical Xi = X'-, for all i = 1, • • • ,nfc. Thus Mn*. = since estimators 
only depend on the samples and not the probabilities. 

The missing masses, on the other hand, do depend on both the samples and the probabilities and 
thus they differ. But the event Ak makes the set of missing symbols simply the tail m + 2k, m + 
2A: + 1, • • •, so we can compute the missing masses exactly: 


I-Ok 


2'»+‘ ' 2".+* 


+ 


= (2-/3)2 


—m—k 


and 


1 - 0 ' 


Mn, = Pe'im + 2k) + Y. + 2™+^ 


= (l + /3)2 


—m—k 


and the claim follows. 


We now show that Ak has always a positive probability, bounded away from zero. 

Lemma 7 For /3 = l/4, m = l, (7 = 6.5 and Uk = C2^, there exists a positive absolute constant 
p > 0 such that for all k, P(24fc) > p. We can explicitly set p = 2 ■ 10“^. 

Proof Note that Ak in Equation (6) overspecifies the event. In fact, only forcing the exact coverage 
of 1, • • • , m + 2A: — 1 is sufficient, since this implies in turn that the coupled samples are identical. 
This is evident for values in 1, • • • , m + 2A: — 2. But since m + 2/c is not allowed in this event, it also 
holds for the value m + 2A: — 1. We can then write Ak = Aky H 24^^22 dividing the exact coverage 
to the localization in the range and the representation of each value by at least one sample: 

= {ur=i{^t} ^ {1) • • • ,m + 2 k - l}} (localization), 

^k ,2 = {Ur=i{^*} 3 {1) • • • ,m + 2 k - 1 }} (representation). 

Let a be the probability of (x, x') being in {(1,1), • • • , {m + 2k — l,m + 2k — 1)}. From the 
coupling in Equation (5) and the structure of the dithered family in Equation (2), we see that for 
up to m + 2A: — 2 this probability sums up to the m + A: — 1 first terms of a geometric(^), and for 
(m + 2A: — 1, m + 2A: — 1) the coupling assigns it /3/2”^+^, thus: 

v-^2fc— 1 / , , 1 /3 

" = Z^.=1 ^ 

We can then explicitly compute: 

P(21,,) = a”‘ = (l - =: 
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Meanwhile, note that conditionally on A^^i, the occurrence probabilities on {(1,1), • • • ,(m + 
2k — l,m + 2k — 1)} are simply normalized by a. By using a union bound on the complement of 
Afc 2 (the event of at least one of these values not appearing), we then have that: 


P(^fc,2lA 


kA) ^ 


E 7/t,-|-ZAv —L 

[l-g(x,x)/aj 

1 - eL, (1 - 

V ~\k—l 

2-^j=i 


rik 


1 - ^ 


rik 


+ 1 - 


1-/3 

2mTJ 


nk 


^ i-E:idi-*r-2E‘;;(' 


2m+j 


nk 


- (1 - ^ 

1 ^ 2m+fc 


nk 


nk 


- =-V2{k). 


Therefore, 


P{Ak) = P{Ak,i n Ak, 2 ) = P{Ak,i)P{Ak, 2 \Ak,i) > r]i{k)ri 2 {k) > inf r]i{k)r] 2 {k) =: r], 

k>l 

We now use our choices of /3 = 1/4, m = 1, C = 6.5, and Uk = C2^, to bound this worst-case 
rj. In particular, we can verify that rj >2 ■ 10“^, and it follows as claimed that the pivotal event has 
always a probability bounded away from zero. ■ 


Induction Step 

We now combine all the elements presented thus far to complete the proof of Theorem 2 by estab¬ 
lishing the following claim, which we have shown in the beginning of the detailed proof section to 
be sufficient for the validity of the induction step. In particular, we restate Equation (4) under the 
coupling of Equation (5). 

Lemma 8 Let 


0 := (0^, 0l_„/3, dk+i, ■■■), and 0' := (0^, 0 ^ 1 ,1 - /?, 0^0 ''' 

with arbitrary trailing parameters (0fc+i, • • •) and (0^1 ’''' )• ^et q be the coupling of Equation 
(5), and let = | Mn^/Mn^ — 1 > e| and .8/ = | — 1 > e|- Then given our 

choices of (3 = 1/4, m = l, 6* = 6.5 and = C2^, if e < 10““^ we cannot simultaneously have 


PniBk) < e and PqiBL) 


< e. 


Proof Note that this choice of e means that e < p/2, where p is as in Lemma 7. Recall the pivotal 
event A^, and assume, for the sake of contradiction, that both probability bounds P{Bk) < e and 
P{B/) < e hold. Note that if B/. holds, it means that 

^nk/^'i^nk £ (1 ~ 1 + ^)t ( 7 ) 
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and similarly if B'j^ holds, it means that 




€ (1 — e, 1 + e) 


( 8 ) 


By making our hypothesis, we are asserting that these events have high probabilities, 1 — e, 
under both pg and pgi distributions, and that thus the estimator is effectively (1 ± e)-close to the true 
value of the missing mass. Yet, we know that this would be violated under the pivotal event, which 
occurs with positive probability. We now formalize this contradiction. 

By Lemma 7, we have that: 


P{Bk\Ak) 

PKIA) 


P{Ak U Bk) ^ P{Bk) ^ e ' 
P{Ak) - P{Ak) - 11 
P{A,VJB',) ^ P{B',) ^ e 
P{Ak) - P{Ak) - rj 


P{BlnB'^^\Ak) > 1 - 2 - > 0 , 

71 


where the last inequality is strict, by the choice of e < ??/2- 

On the other hand, recall that by Lemma 6 under Ak we have: 


(9) 


Mni. = M' and 

n-fc nfc 


M, 


rik 




2_^ _ 7 

1 + /3 “ ^ 


By combining this with Equations (7) and (8), we can now see that if which is satisfied 

by any choice of e < 1/6, in particular ours, then if B^ occurs, then B'f, occurs, and conversely if 
B'^ occurs then B^ occurs. For example, say B^ occurs, then < (1 + e): 





^rik 

Mnk 


|(1 + e) < 1 — e, 


implying that Equation (8) is not satisfied, fhus B'j, occurs. The end resulf is fhaf under evenf A^, 
B'f, and B'^ cannof occur af fhe same fime, and fhus: 


P{Bl^B'^\Ak) = Q. 


This confradicfs fhe bound in (9), and esfablishes fhe lemma. 


3. Discussions 


3.1. Generalization to continuous tails 


A closely related problem to learning the missing mass is that of estimating the tail of a probability 
distribution. In the simplest setting, the data consists of Ti, • • • ,Yn that are i.i.d. samples from a 
continuous distribution on M. Eet F be the cumulative distribution function. The task in question is 
that of estimating the tail probability 


W, 


= i-f( 


max Yj 
V i=i 


that is the probability that a new sample exceeds the maximum of all samples seen in the data. 

One can immediately see the similarity with the missing mass problem, as both problems con¬ 
cern estimating probabilities of underrepresented events. We can use essentially the same PAC- 
leaming framework given by Definition 1, and prove a completely parallel impossibility result. 
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Theorem 9 For every estimator Wn of Wn there exists a distribution F*, a positive value e > 0, 
and a subsequence {nk)k=i, 2 ,---> such that for all k: 



In particular, it follows that it is impossible to perform distribution-free PAC-leaming of the tail 
probability in relative error. 

Proof [Sketch] Recall that in the proof of Theorem 2, the pivotal event forced the missing mass to 
be a tail probability. Therefore, most of the arguments go through unchanged. Instead of dither¬ 
ing a geometric distribution, we dither an exponential distribution, by shifting the mass in adjacent 
blocks. Some of the adjustments that need to be performed concern the exact location of the samples 
within each block, but coarse bounds can be given by taking the extremities of each block instead. ■ 

Theorem 9 gives a concrete justification of why it is important to make regularity assump¬ 
tions when extrapolating distribution tails. This is of course the common practice of extreme 
value theory, (see, for example, Beirlant et ah, 2004). Some impossibility results concerning the 
even more challenging problem of estimating the density of the maximum were already known, 
(Beirlant and Devroye, 1999), but to the best of our knowledge this is the first result asserting it for 
tail probability estimation as well. 

3.2. Learning in various families 

Ben-Hamou et al. (2014) (Corollary 5.3) gives a very clean characterization of a sufficient leamable 
family, which encompasses the one covered by Ohannessian and Dahleh (2012). 

Theorem 10 (Ben-Hamou et al. (2014)) Let FL be the family of distributions on N+ that satisfy 


E ^ l{npn{x) = 1} = ^ np{x)[l — p{x)Y^ ^ (X). 




a:eN+ 


The Good-Turing estimator PAC-learns the missing mass in relative error with respect to FL. 

Note that this theorem in the cited paper asks for an additional technical condition, but this can 
be relaxed. The proof relies on power moment concentration inequalities (such as Chebyshev’s). 
For us, this is instructive because one could readily verify that the condition of Theorem 10 fails 
for geometric (and dithered geometric) distributions. We can thus see that in some sense Good- 
Turing captures a maximal family of learnable distributions. In particular, we now know that the 
complement of FL is not leamable. 

Considering how sparse the dithered geometric family is, the failure of any estimator to learn 
the missing mass with respect to it may seem discouraging. (Note that Theorem 2 holds even if the 
estimator is aware that this is the class it is paired with.) However, if we restrict ourselves to smooth 
parametric families within the light tails then the outlook can be brighter. We illustrate this with the 
case of the geometric family. 
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Theorem 11 Let Q be the class of geometric distributions, parametrized by a G (0,1); 

Pa{x) = (1 — a)a^~^ , for X G N+. 

Let «„ = 1 — be the empirical estimator of the parameter, and define the plug-in estimator: 

Mn= X] “ an)a^l{npn{x) = 0} 
a;eN+ 


Then PAC-learns the missing mass in relative error with respect to Q. 

Proof [Sketch] The proof consists of pushing forward the convergence of the parameter to that of 
the entire distribution using continuity arguments, and then specializing to the missing mass. The 
details can be found in the appendix. ■ 


3.3. A^-gram models and Bayesian perspectives 

One of the prominent applications of estimating the missing mass has been to computational linguis¬ 
tics. In that context, it is known as smoothing and is used to estimate A^-gram transition probabili¬ 
ties. The importance of accurately estimating the missing mass, and in particular in a relative-error 
sense, comes from the fact that A^-grams are used to score test sentences using log-likeliehoods. 
Test sentences often have transitions that are never seen in the training corpus, and thus in order for 
the inferred log-likelihoods to accurately track the true log-likelihood, these rare transitions need to 
be assigned meaningful values, ideally as close to the truth as possible. As such, various forms of 
smoothing, including Good-Turing esimation, have become an essential ingredient of many practi¬ 
cal algorithms, such as the popular method proposed by Kneser and Ney (1995). 

In the context of A^-gram learning, a separate Bayesian perspective was also proposed. One of 
the earliest to introduce this were MacKay and Peto (1995) using a Dirichlet prior. This was shown 
to not be very effective, and we now understand that it is due to the fact that (1) the Dirichlet process 
produces light tails while language is often heavy-tailed and, even if it were, (2) rare probabilities are 
hard to learn for large light-tailed families. The natural progression of these Bayesian models led to 
the use of the two-parameter Poisson-Dirichlet prior (Pitman and Yor, 1997), which was suggested 
initially by Teh (2006). It is worth remarking that a significant part of the contribution of these 
Bayesian models, beyond modeling the missing mass, were to introduce formal hierarchies, which 
is outside our scope. Concerning the missing mass, however, this line of work soon remarked that 
the inference techniques closely followed the Good-Turing estimator, albeit being computationally 
much more demanding. In light of the present work, this is not surprising since the two-parameter 
Poisson-Dirichlet process almost surely produces heavy-tailed distributions, and any two algorithms 
that learn the missing mass are bound to have the same qualitative behavior. 

4. Summary 

In this paper, we have considered the problem of learning the missing mass, which is the probability 
of all unseen symbols in an i.i.d. draw from an unknown discrete distribution. We have phrased this 
in the probabilistic framework of PAC-leaming. Our main contribution was to show that it is not 
possible to learn the missing mass in a completely distribution-free fashion. 
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In other words, no single estimator ean do well for all distributions. We have given a detailed 
aeeount of the proof, emphasizing the intuition of how failure ean oeeur in large light-tailed fami¬ 
lies. We have also plaeed this work in a greater eontext, through some discussions and extensions 
of the impossibility result to continuous tail probability estimation, and by showing that smaller, 
parametric, light-tailed families may be learnable. 

An initial impetus for this paper and its core message is that assuming further structure can be 
necessary in order to learn rare events. Further structure, of course, is nothing more than a form of 
regularization. This is a familiar notion to the computational learning community, but for a long time 
the Good-Turing estimator enjoyed favorable analysis that focused on additive error, and evaded this 
kind of treatment. The essential ill-posedness of the problem was uncovered by studying relative 
error. But lower bounds cannot be deduced from the failure of particular algorithms. Our result thus 
completes the story, and we can now shift our attention to studying the landscape that is revealed. 

The most basic set of open problems concerns establishing families that allow PAC-learning 
of the missing mass. We have seen in this paper some such families, including the heavy-tailed 
family learnable by the Good-Turing estimator, and simple smooth parametric families, learnable 
using plug-in estimators. How do we characterize such families more generally? The next layer of 
questions concerns establishing convergence rates, via both lower and upper bounds. The fact that 
a family of distributions allows learning does not mean that such rates can be established. This is 
because any estimator may be faced with arbitrarily slow convergence, by varying the distribution 
in the family. In other words we may be faced with a lack of uniformity. How do we control the 
convergence rate? Lastly, when learning is not possible, we may want to establish how gracefully 
an estimator can be made to fail. Understanding these limitations and accounting for them can be 
critical to the proper handling of data-scarce learning problems. 
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Appendix A. Proof of Theorem 11 

(Notation and outline) Eet us first set some notation. Recall that the mean of the geometric 

distribution Pa{x) = (1 — a)a^~^ is p. = and its variance is cr^ = Let us write the 

empirical mean and our parameter estimate respectively as follows: 



The plug-in probability estimate can be expressed as: 


Pn{x) := (1 - dn)an ^ 


Using our notation for the missing symbols, En := {x G N+ : p{x) = 0}, the missing mass is 


Mn=Pa{En)= ^ (1 - 


and the suggested plug-in estimator can be written as 


Mn := Pn{En) = X] ^ 


The following proof first establishes the convergence of the parameter estimate and then pushes 
it forward to the entire distribution, specializing in particular to the missing mass. Eor the latter, 
we establish some basic localization properties of the punctured segment of a geometric sample 
coverage. This is related to the general study of gaps (see, for example, Eouchai'd and Prodinger, 
2008). 

We have the following elementary convergence property for the parameter. 
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Lemma 12 (Parameter Convergence) Let <5 > 0, and define: 


• — 


6n 


\ 1 


Then, at every n > j, we have that with probability greater than 1 — 5: 


Otr 


a 


- 1 


< en and 


l-ar. 


1 — a 


- 1 


^n. • 


If we let rjn = ^n/(1 — ^n), we can also write this as 


1 a„ 

< — < 1 + and 


1 1 — 

< - - <l + r]n- 


l + r]n a I+ r]n I - a 

Proof From Chebyshev’s inequality, we know that for all 5 > 0: 


We now simply have to verify that \jln — iA ^ implies that both | ^ — l| and 
eed, using 

(fin ~ 


smaller than e^. Indeed, using fin > p — 


1 Otri 


1—a 


- 1 


are 




_ 


a 


and 


1 - 



1 


finip-l) 


- 1 


- 1 


(fin - P) 


1 


fin(p-l) 


— \fin P\ 


1 


^-1 

Pn 


(p- fin) ^ 


< \ fin- P\ 


1 




Finally, since \fin — p\ < both of these bounds are smaller than: 

1 


a 


'/Sn’ 

1 


\/a 

1 — 0 . 


which is equal to e„. The expression with pn follows from 1 — e^, = and 1 + ??n > 1 + ^n- 


It follows from Lemma 12 that with probability greater than 1 — 5, we have the following 
pointwise convergence of the distribution. 

(1 + pn)-^(l - < Pa(x) < (1 + r/n)"(l " 

Since the rate of this convergence is not uniform, we need to exercise care when specializing to 
particular events. We focus on the missing symbols’ event. We have: 

+ hn)~^(l - a)a^~^ < ^ = PnjEn) ^ (^ + (1 " 

~ Pa(En) ~ ExgE„ 

The event En is inconvenient to sum over, because it has points spread out randomly. This is 
particularly true for its initial portion, where the samples “puncture” it. It it is more convenient to 
approximate this segment in order to bound Equation (10). We now formalize this notion, via the 
following definition. 
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Definition 13 (Punctured Segment) The punctured segment of a sample is the part between the 
end of the first contiguous coverage and the end of the total coverage. Its extremities are: 

V~ :=min£'„ and := maxii^^. 

We have the following localization property for the punctured segment of samples from a geo¬ 
metric distribution. 


Lemma 14 (Localization of Punctured Segment) Let Xi, - ■ ■ , Xn be samples from a geometric 
distribution Pa{x) = (1 — a)a^~^ on N+. Let V~ and be the extremities of the punctured 

segment as defined in Definition 13. Then, for all u > we have: 

< logi/„(n) - logi/„(M)} < 2e"^“ < , 

> logl/a(n) -h 1 -h logi/„(u)} < 

In particular, for 6 < (1 — a)/ of, we have that with probability greater than 1 — 6: 


logl/aW -logl/a 


1 

(l-a)5 


< K < K < logl/a(ra) + 1 + logl/a 


1 

(1—o)5 


Proof 

Given an integer a G N+, the event that V~ < a implies that one of the values below a did not 
appear in the sample. By using the union bound, we thus have that: 


P{V^n < a} < 1 . 

(1 — 


, x—ll 


< 

< 


£=l 

E oo 

^ ^exp —[l — a)na' 


a—l—£ 


By specializing to a{u, n) 


logl/a(ra) + l-logi/o(u) 


P{L„ < logi/^(n) - logi/„(?r)} < 

< 


P{Vn <a{u,n)} 

[-(1 - 

E OO ff 

^ ^ exp —{l — a)a~^u 


-l 


Lastly, if rt > (p^)^, one can show by induction that (1 — a)a + 1—1. This turns 

the sum into a geometric series, giving: 

P{Pn“ < logi/„(n) - logi/^(M)} < e" Yl7=i ^ 2e- 

Next, note that Vf~ is nothing but the maximum of the samples. Thus, given an integer b G N+, 
the event Vf' > bis the complement of the event that all the samples are at b or below. Since the 
total probability of the range 1, • • • , 5 is 1 — a^, we thus have: 

P{F+ > 6 } = 1 - (1 - 
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If we now specialize to b{u, n) = log^/„(n) + log^/Q,(it) , we have that: 

> logi/a(n) + 1 + logi/„(n)} < P{I/+ > b{u,n)} 

< 1 _ _ Q,logi/cW+logi/„(n)y 

= 1-11-I < 


u ■ n 


u 


For the last part of the claim, we let u = followed by a union bound on the an¬ 

alyzed events. This gives us that at least one of the two events holds with probability at most 
u {i-a)u ~ therefore neither holds with probability at least 1 — 5, as desired. ■ 


(Completing the proof) We now put together the pieces of the proof of Theorem 11. To show 
that our estimator PAC-learns the missing mass in relative error with respect to we obtain the 
following equivalent statement. Fix <5 > 0 and r/ > 0. We prove that for n large enough with 
probability greater than 1 — 25 we have: 


1 

1 + rf 


< 


M„ 


< 1 + ??. 


Without loss of generality, to satisfy the conditions of Lemmas 12 and 14, we restrict ourselves 
to 5 < (1 — a)(we can always choose a smaller 6 than specified) and n > j (we can always ask 
for n to be larger). As such, we have that with probability at least 1 — 25, both events of Lemmas 
12 and 14 occur. We work under the intersection of these events. 

We give the details of only the right tail of the convergence; all the steps can be directly paral¬ 
leled for the left tail. To see why the punctured set is a useful notion, we claim that the following 
quantity upper bounds the right tail of Equation (10): 




(1 -h 


EygN+(l + ^ 

EyeN+(l = 1 


(1 + r]n) 


V+ + Vn) 

l-a{l+ r]n) 


( 11 ) 


where for the first equality we have used the change of variable y = x — and simplified fhe 
common a facfors in fhe numerator and denominator, and for the second equality we have used the 
moment generating function of the geometric distribution: E[e^^] = (1 —a)e®/(l —ae^). To prove 
this claim, we proceed by induction, starting at step t = 1 with the set := {V^ + 1, + 

2, • • • } C En, adding at every step t the largest element of not yet in to obtain 

and proving that: 

ExgGw(l + ^»)''(l < ExgG(t-i) (1 + Vnfi)- - 

ExgGtoCl-")""'"^ “ ExgG(‘-i)(l-«)«''"^ 

We use the following basic property that for positive real numbers oi, 6 i, 02 , 62 > the following 
three equalities are equivalent: 



al/61 

< 

02/62, 

01/61 

< 

(oi -h 02)/(6 i -h 62 

(oi -h 02)7(61 -h 62) 

< 

02/62- 
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For the base case, let 02 = + VnTi'i- - a)a^~^ and 62 = Z]xeG(i)(l “ a)a^~^. 

We then choose the largest & En\ and we let ai = (1 + and 

61 = (1 — From (11), noting that the fraction is always greater than 1, it follows that 

02/^2 > (1 + > (1 + rfnY = a\lh\. We can thus add to the sum, and obtain 

(oi + a 2 )/{bi + 62 ) < a 2 jb 2 , establishing the base case. Note that this also shows that (ai + 
02 )/(&i + ^ 2 ) > 01/&1 = (1 + ■ We pass this property down by induction, and we can 

assume this holds true at every step. 

To continue the induction at step t, let 02 = Z]3;eG(*-i) (1 + VnYY ~ and 62 = 

X]xeG(‘-i) (1 — a)a^ As noted, we assume that 02/62 > (1 + from the previous induc¬ 
tion step. We then choose the largest & En\ and we let oi = (1 + ijnY^*'’ (1 “ 

and 61 = (1 —Since it follows that 02/62 > ( 1 +^n)^^* = 

01/61. We can thus add z^^'^ to the sum, and obtain (oi + 02)/(6i + 62) < 02/62, as desired. Note 
that this also shows that (oi + 02)/(6 i + 62) > 01/61 = (1 + and the induction is complete. 

By combining this result with the equivalent argument on the left side, we have effectively 
shown that we can replace Equation ( 10 ) by 


Ex>v-i^ + Vn) ^ 

or equivalently by 


< 


^ Pn{En) ^ 'Lx>V+(^ + ^nYY-a) 0 ^ 
Mn Pa{En) ~ 


x—l 




Ex>t/+^ 
y+(l -«)(1 + ^n) 


1 - 0(1 + r]n) 


( 12 ) 


In Lemma 12 we have set: 


Pn — I^n/(1 ^n)) 


with 



|^ max{l,i^} 


1 - 


5n 


On the other hand, by Lemma 14, we have that: 


VY < logi/a(?^) + 1 + logi/„ 


1 

(1 —q:)5 


and 


K > logi/„(n)-logi/„ 


1 

(1 — 0)5 


It follows that both bounds of Equation (12) converge to 1, at the rate of roughly log(n)/-y/n, 
instead of the parametric rate l/yTr. Regardless, for any desired 77 > 0, we get that there exists a 
large enough n beyond which, with probability greater than 1 — 25, we satisfy: 


1 Mn 

< — <l + r]. 


1 + 77 M, 

This establishes that Mn PAC-leams Mn, as desired. 
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